Goto

Collaborating Authors

 practical loss-based stepsize adaptation


L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including dense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on classical datasets MNIST, fashion MNIST, CIFAR10 and others.



Reviews: L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems

The paper proposes a scheme for adaptive choice of learning rate for stochastic gradients descent and its variants. The key idea is very simple and easy to implement: given the loss value L at the global minimum, L_min, the idea is to choose learning rate eta, such that the update along the gradient reaches L_min from the current point i.e. solving L(theta-eta*v) L_min in eta, where v is for example dL/dtheta in gradient descent. Finally, to make the adaptive learning rate pessimistic to the possible linearization error, the authors introduce a coefficient alpha, so the effective learning rate used by the optimizer is eta*alpha. The authors empirically show (on badly conditioned regression, MNIST, CIFAR-10, and neural computer) that using such adaptive scheme helps in two ways: 1. the optimization performance is less sensitive to the choice of the coefficient alpha vs the learning rate (in non-adaptive setting), and 2. the optimizer can reduce the loss faster or at worst in equal speed with commonly used optimizers. At the same time, the paper has some shortcomings as admitted by the authors: 1.


L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including dense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on classical datasets MNIST, fashion MNIST, CIFAR10 and others. Papers published at the Neural Information Processing Systems Conference.


L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including dense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on classical datasets MNIST, fashion MNIST, CIFAR10 and others.


L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including dense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on classical datasets MNIST, fashion MNIST, CIFAR10 and others.


L4: Practical loss-based stepsize adaptation for deep learning

arXiv.org Machine Learning

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by strongly improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including ResNets and the Differential Neural Computer. A prototype implementation as a TensorFlow optimizer is released.